The lockdown and social distancing measures that were brought in throughout the world to tackle COVID in 2020 have had a significant, widespread effect on crime. In this notebook, I use public London crime data on robbery and burglary to examine where this “COVID crime shift” was strongest, and whether any specific drivers or correlates can be identified.
Our findings suggest that the relative change in burglary and robbery trends during “lockdown” in April and May 2020 was heavily affected by local characteristics: areas with a high residential population saw teh sharpest decreases in burglary (likely due to a reduction in available targets) while the reduction in robberies instead seem to be driven by geographic features and proxy indicators of deprivation (potentially suggesting more available targets for robbery in communities least able to work for from).
The primary purpose of this exercise was to learn R - I’ve previously worked entirely in Python, which is more than sufficient 99% of the time, but has at times proved a blocker when I want to tackle some more experimental geospatial methods or tools geared towards the academic community. With that in mind, this is likely to be a little messy, and I’ll aim to condense my main lessons into a blog post in the future. The models are not heavily tuned (aiming to explore correlates rather than provide accurate predictions) and certain predictors are likely to correlate with each other, and as such likely do not imply direct causations.
In this exercise, I’m using three years of Metropolitan Police Service data from data.police.uk - I’ve downloaded these manually rather than using their API.
Unlike Python has the benefit of being “focused” - it’s primaryily a tool for academics/scientists, rather than a programming language. That means the eco-system is refreshingly helpful: at first glance, there is one primary IDE, one package library, and it all works. Unlike Python, most libraries are imported globally, and that works okay…mostly
# Data manipulation, transformation and visualisation
library(tidyverse)
# Nice tables
# Simple features (a standardised way to encode vector data ie. points, lines, polygons)
# Spatial objects conversion
library(sp)
library(tmap)
# Colour palettes
library(RColorBrewer)
# More colour palettes
library(viridis)
#ggplot organiastion
library(ggpubr)
library(rgdal) # input/output, projections
library(lubridate) #date and time
#random forest and metrics
library(randomForest)
require(caTools)
library(DALEX)
library(dplyr)
library(corrr)
Importing data from CSV files behaves quite similarly to Python. To build our process, we’ll start by taking one month of crime data, exploring it, and writing all our steps for automation.
test_df <- read.csv("crimes/2018-01/2018-01-metropolitan-street.csv")
test_df
Error in gregexpr(calltext, singleline, fixed = TRUE) :
regular expression is invalid UTF-8
Error in gregexpr(calltext, singleline, fixed = TRUE) :
regular expression is invalid UTF-8
Our crime is categories according to the Home Office major crime types, and like Python, we can list them all through the “unique” function. Here I’ll be focusing on robbery and burglary: two crime types that are heavily reliant on encountering victim’s in public spaces, and as such should be affected by the “COVID effect”.
unique(test_df["Crime.type"])
To avoid this getting particularly computationally intensive, let’s write a function to pull out robberies and burglaries, and assign them a specific MSOA. Then we can iterate over all our months and get monthly counts for each offence type.
subset_df <- filter(test_df, Crime.type=="Burglary" | Crime.type=="Robbery")
subset_df
Our single month of data contains 10,501 crimes.
There are R/Python quirks that will take some getting used to. While in Python, most columns can be referenced through their string name ([“name”]), R seems somewhat fussier. Other than that, the move from one to another is largely intuitive.
We now need to link this to our spatial data. We use the MSOA borders provided by MOPAC, and use the UK National Grid coordinate system. Police.uk does not use that system, so we’ll need to reproject our crime data.
lsoa_borders <- st_read("msoa_borders/MSOA_2011_London_gen_MHW.tab", crs=27700)
Reading layer `MSOA_2011_London_gen_MHW' from data source `C:\Users\andre\Dropbox\Data Projects\Covid_crime_shift\msoa_borders\MSOA_2011_London_gen_MHW.tab' using driver `MapInfo File'
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
lsoa_borders
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
First 10 features:
MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz
1 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5
3 10045 10033 12 46.9 3834 2.6
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7
6 8791 8672 119 50.6 3441 2.5
7 E02000008 Barking and Dagenham 007 E09000002 Barking and Dagenham E12000007 London 11569 11564 5 81.5 4591 2.5
8 E02000009 Barking and Dagenham 008 E09000002 Barking and Dagenham E12000007 London 8395 8376 19 87.4 3212 2.6
9 E02000010 Barking and Dagenham 009 E09000002 Barking and Dagenham E12000007 London 8615 8615 0 76.8 3292 2.6
10 E02000011 Barking and Dagenham 010 E09000002 Barking and Dagenham E12000007 London 6187 6086 101 38.8 2289 2.7
geometry
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550 1873...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
7 MULTIPOLYGON (((548171.4 18...
8 MULTIPOLYGON (((546855 1863...
9 MULTIPOLYGON (((549618.8 18...
10 MULTIPOLYGON (((550244.1 18...
Unlike our previous dataframes, this isn’t “tidy” (due to a slightly different format containing geographical data)
plot(lsoa_borders)
plotting the first 9 out of 12 attributes; use max.plot = 12 to plot all
Before we can link our crimes to MSOA, we’ll need to ensure identical coordinate systems, but before we do that, we’ll need to erase any missing values.
#count missing values in the longitude column
sum(is.na(subset_df["Longitude"]))
[1] 82
As such, we have 82 crimes with no geographic identifiers, which we remove from our analysis.
clean_df <- subset_df[!rowSums(is.na(subset_df["Longitude"])), ]
clean_df
We can now convert our crime data to spacial data, using our longitude and latitude coordinates - this allows us to quickly plot our data, and confirm it looks right.
subset_spatial <- st_as_sf(clean_df, coords = c("Longitude", "Latitude"),
crs = 4326, remove = FALSE)
subset_spatial
Simple feature collection with 10419 features and 12 fields
geometry type: POINT
dimension: XY
bbox: xmin: -0.492381 ymin: 51.28683 xmax: 0.273434 ymax: 51.68564
geographic CRS: WGS 84
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
5 50ad5d2dfea24afec9e17218db62b3d29786775db1060634ae7d4a6e7cafc3ff 2018-01 Metropolitan Police Service Metropolitan Police Service 0.127794 51.58419
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context geometry
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA POINT (0.140035 51.58911)
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA POINT (0.140035 51.58911)
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (0.135554 51.58499)
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (0.140035 51.58911)
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA POINT (0.127794 51.58419)
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA POINT (0.138439 51.5785)
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.138439 51.5785)
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.139479 51.57974)
9 On or near Portland Close E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.135119 51.57849)
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA POINT (0.140452 51.5811)
plot(subset_spatial)
plotting the first 9 out of 12 attributes; use max.plot = 12 to plot all
Success! That looks faintly promising. Now, let’s figure out how to re-project.
latlong = "+init=epsg:4326"
ukgrid = "+init=epsg:27700"
subset_osgb <- st_transform(subset_spatial, ukgrid)
GDAL Message 1: +init=epsg:XXXX syntax is deprecated. It might return a CRS with a non-EPSG compliant axis order.
subset_osgb
Simple feature collection with 10419 features and 12 fields
geometry type: POINT
dimension: XY
bbox: xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context geometry
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA POINT (548349 189976)
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA POINT (548349 189976)
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (548052 189507.9)
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (548349 189976)
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA POINT (547517 189404)
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA POINT (548273 188793)
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (548273 188793)
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable Burglary Status update unavailable NA POINT (548043 188785)
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA POINT (548404 189086)
We now run a “spatial join”: assigning an MSOA to each of our crimes, based on the MSOA border map they are within, and combining these as one table.
crime_with_msoa <- st_join(subset_osgb, lsoa_borders["MSOA11CD"])
crime_with_msoa
Simple feature collection with 10419 features and 13 fields
geometry type: POINT
dimension: XY
bbox: xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
5 50ad5d2dfea24afec9e17218db62b3d29786775db1060634ae7d4a6e7cafc3ff 2018-01 Metropolitan Police Service Metropolitan Police Service
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context MSOA11CD
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA E02000002
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA E02000002
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA E02000002
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA E02000002
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA E02000002
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA E02000002
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
9 On or near Portland Close E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA E02000002
geometry
1 POINT (548349 189976)
2 POINT (548349 189976)
3 POINT (548052 189507.9)
4 POINT (548349 189976)
5 POINT (547517 189404)
6 POINT (548273 188793)
7 POINT (548273 188793)
8 POINT (548341 188933)
9 POINT (548043 188785)
10 POINT (548404 189086)
That has worked: each crime is now assigned to an MSOA. We can now group our table by count of MSOA, to obtain the monthly count of
msoa_list<- crime_with_msoa %>%
group_by(MSOA11CD, Crime.type) %>%
summarize(count_by_msoa = n())
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
msoa_list
Simple feature collection with 1722 features and 3 fields
geometry type: GEOMETRY
xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
We now have a count for each, and we can remove the geometry data before automating our process.
msoa_pivot_tibble <- as_tibble(msoa_list)
msoa_pivot_tibble <- msoa_pivot_tibble[0:3]
msoa_pivot_tibble
NA
As a final stage prior to processing the rest of our data, we will fill any missing msoa/month combinations with a “0”, to ensure consistent trends.
#creating a df with all msoa names, for robbery and burglary
msoa_zero_df_robbery <- unique(as_tibble(lsoa_borders)["MSOA11CD"])
msoa_zero_df_burglary <- unique(as_tibble(lsoa_borders)["MSOA11CD"])
#adding our crime type column
msoa_zero_df_burglary["Crime.type"] = "Burglary"
msoa_zero_df_robbery["Crime.type"] = "Robbery"
#Creating a "count" column identical to our pivot, and filling it with 0
msoa_zero_df_robbery["count_by_msoa"] = as.numeric(0)
#combine both
duplicate_concat <- rbind(msoa_zero_df_robbery, msoa_zero_df_burglary)
#add our duplicates to our original table
df_with_dups <- rbind(msoa_pivot_tibble, duplicate_concat)
dup_filters <- duplicated(df_with_dups[0:2])
#bring it all back together
monthly_df <- filter(df_with_dups, !dup_filters)
monthly_df
#select the first unique value of months in the original dataframe
month <- unique(test_df["Month"])[1,1]
monthly_df["Month"] <- month
monthly_df
NA
we now have our basic functionality to create a time series for both crime types, by MSOA - a count of each type, and the month.
I can now bring all my previous work together into a function, which we can use to automate the process.
#quick initial function to generate our MSOA borde spatial frame, to avoid it sitting in the initial frame and gobbling loads of memory.
generate_msoa_borders <- function(file){
msoa_borders <- st_read(file, crs=27700)
return(msoa_borders)
}
make_month_pivot <- function(file){
#define our CRS
latlong = "+init=epsg:4326"
ukgrid = "+init=epsg:27700"
#read our crime from the file
test_df <- read.csv(file)
#select only our target crime types
subset_df <- filter(test_df, Crime.type=="Burglary" | Crime.type=="Robbery")
#remove any rows with a long/lat coordinate
clean_df <- subset_df[!rowSums(is.na(subset_df["Longitude"])), ]
#generate a spatial df
subset_spatial <- st_as_sf(clean_df, coords = c("Longitude", "Latitude"),
crs = 4326, remove = FALSE)
#reproject to uk grid coords
subset_osgb <- st_transform(subset_spatial, ukgrid)
#spatially join to assign to an MSOA
crime_with_msoa <- st_join(subset_osgb, msoa_borders["MSOA11CD"])
#summarise by count of MSOA
msoa_list<- crime_with_msoa %>%
group_by(MSOA11CD, Crime.type) %>%
#return to a non-geographic msoa
msoa_pivot_tibble <- as_tibble(msoa_list)
msoa_pivot_tibble <- msoa_pivot_tibble[0:3]
#creating a df with all msoa names, for robbery and burglary
msoa_zero_df_robbery <- unique(as_tibble(msoa_borders)["MSOA11CD"])
msoa_zero_df_burglary <- unique(as_tibble(msoa_borders)["MSOA11CD"])
msoa_zero_df_burglary["Crime.type"] = "Burglary"
msoa_zero_df_robbery["Crime.type"] = "Robbery"
#Creating a "count" column identical to our pivot, and filling it with 0
msoa_zero_df_burglary["count_by_msoa"] = as.numeric(0)
duplicate_concat <- rbind(msoa_zero_df_robbery, msoa_zero_df_burglary)
df_with_dups <- rbind(msoa_pivot_tibble, duplicate_concat)
#creating a filter for duplicates columns, which should ignore the first instance
dup_filters <- duplicated(df_with_dups[0:2])
monthly_df <- filter(df_with_dups, !dup_filters)
month <- unique(test_df["Month"])[1,1]
monthly_df["Month"] <- month
return(monthly_df)
}
We now have a rudimentary tooling pipeline, which we can iterate over our subfolders to aggregate our data (in hindset, I probably should have explored the API options).
#create empty dataframe to bring together our data
empty_df <- tibble(
MSOA11CD = "",
Crime.type= "",
count_by_msoa= "",
Month= ""
)
#re-ingest our MSOA data
msoa_borders <- generate_msoa_borders("msoa_borders/MSOA_2011_London_gen_MHW.tab")
Our function will iterate over each file, add them to our empty table, until each is complete (and we have a single, aggregated file)
subfiles <- list.files(path = "crimes", recursive=T)
for (file in subfiles){
folder_subdir <- "crimes/"
#concatenate to get our total subfolder directory - hacky but will work here.
sub_path <- paste(folder_subdir, file, sep="")
monthly_df <- make_month_pivot(sub_path)
}
empty_df
We now have a combined dataframe of 71,848 rows, from January 2018 through December 2020.
unique(empty_df["Month"])
From a practical perspective, that was slightly more painful than I expected - I’m not sure if data-wrangling is less intuitive in R, or it’s just practice, but it’s noticeable how easy use cases are less easily accessible as tutorials (probably due to the different user base )
#saving file to CSV
#write.csv(empty_df,"msoa_crime_matrix.csv")
With our data now cleaned and aggregated, we can focus on the more interesting part - forecasting our “expected” pandemic crime, and examining how much it diverges from our “actual” crime.
empty_df <- read.csv("msoa_crime_matrix.csv")
empty_df <- empty_df[2:70848,2:5]
empty_df
Before going any further, let’s use this to explore and visualise the distribution of robbery and burglary across time and space during our “pre-pandemic” period, in March 2020 - based on London mobility indicators, this is when movement accross London began to be heavily affected, and the disruption was most notable in April
London mobility data
burglary_df<-empty_df
#add a "1" so our month can be converted to a full date
burglary_df$DateString <- paste(burglary_df$Month, "-01", sep="")
#convert to date format
burglary_df$DateClean <- ymd(burglary_df$DateString)
#filter out only burglary prior to the pandemic
burglaryExplore <- filter(burglary_df, DateClean < "2020-03-01" & Crime.type=="Burglary")
burglaryExplore
Looking at the aggregate counts of burglary across London, a visual observation suggests yearly trends (which we’ll have to consider in our forecast), which sharp peaks during the Winter months and the lowest numbers in summer (when the days are longest).
burglary_by_month <- burglaryExplore %>%
group_by(DateClean) %>%
summarize(total_burglaries = sum(count_by_msoa))
ggplot(burglary_by_month, aes(x=DateClean, y=total_burglaries)) +
geom_line()
To observe how crime counts are distributed in space, let’s map both counts by MSOA. As previously mentioned, MSOAs are designed to be comparable units, at least from a population perspective - we don’t need to produce per population rates.
burglary_by_msoa <- burglaryExplore %>%
group_by(MSOA11CD) %>%
summarize(total_burglaries = sum(count_by_msoa))
burglary_map <- left_join(lsoa_borders, burglary_by_msoa, by = "MSOA11CD")
pal <- brewer.pal(5,"BuGn")
tm_fill(col = "total_burglaries", title = "Total Burglary Count by MSOA", style="quantile", palette="BuGn") +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
robbery_df<-empty_df
robbery_df$DateString <- paste(robbery_df$Month, "-01", sep="")
robbery_df$DateClean <- ymd(robbery_df$DateString)
robberyExplore <- filter(robbery_df, DateClean < "2020-03-01" & Crime.type=="Robbery")
robbery_by_msoa <- robberyExplore %>%
group_by(MSOA11CD) %>%
summarize(total_robberies = sum(count_by_msoa))
robbery_map <- left_join(lsoa_borders, robbery_by_msoa, by = "MSOA11CD")
pal <- brewer.pal(5,"BuGn")
robbery_map <-tm_shape(robbery_map) +
tm_fill(col = "total_robberies", title = "Total Robbery Count by MSOA", style="quantile", palette="BuGn") +
tmap_arrange(burglary_map, robbery_map, nrow = 2)
We notice that robbery is noticeably more concentrated in central London, with burglary remaining quite common across the city. That said, there are also obvious spatial patterns here - these crimes are clustered in certain geographies.
We can now begin the forecasting process. To design our process, we’ll start by focusing on a single MSOA - the first in our dataset, E02000001, or the City of London.
single_msoa_df <- filter(empty_df, MSOA11CD == "E02000001" & Crime.type=="Burglary")
single_msoa_df$DateString <- paste(single_msoa_df$Month, "-01")
single_msoa_df
From a forecasting/time-series perspective, this is a very small dataset - 36 monthly observations. We will be shrinking this further to only 26 by focusing on data prior to March 2020, when the COVID crime impact is felt. This significantly limits our forecasting options, and will impact accuracy, if we treat each MSOA in isolation - we could explore some sort of Vector Autoregressive Model to limit this, but given that we’re then going to be exploring the error of all our models in aggregation, this isn’t crucial. Our focus is on models that we can accurately deploy without needing to tune each of them individually, and that can capture the seasonal trend, and generate reliable predictions on our limited dataset.
Given these limitations, I’ve opted for the Prophet algorith. While it’s more opaque than a auto-arima or VAR model, it works well with monthly data, and extracting seasonal trends. It also requires very little tuning.
As such, we’ll extract our “training set” prior to March, and start forecasting.
training_set <- filter(single_msoa_df, DateClean < "2020-03-01")
training_df <- tibble(
y=training_set$count_by_msoa
)
training_df
library(prophet)
m <- prophet(training_df)
For now, we’ll forecast on a 6 month horizon - we obviously wouldn’t expect it to be accurate that far into the future.
future <- make_future_dataframe(m, periods = 6, freq = 'month')
forecast <- predict(m, future)
plot(m, forecast)
As we can see, the model seems consistent on a short horizon, and gets very wide as it goes further into the future. More importantly however, it has extracted a yearly seasonal compontent - the summer decrease we identified previously - as well as a long term trend.
prophet_plot_components(m, forecast)
These predictions seem far-fetched, but remember we will be observing a London wide error rate. As such, we must now isolate our “pandemic period” - which we define as April and May 2020 - and compare the predicted crime counts to the actual crime counts to obtain a metric of our “COVID crime shift”, or our error rate.
forecast$Month <- month(forecast$ds)
forecast$Year <- year(forecast$ds)
this_year <- filter(forecast, Year > 2019)
peak_pandemic <- filter(this_year, Month== 4 | Month== 5 )
predictionPivot <- peak_pandemic %>%
group_by(Month) %>%
single_msoa_df$MonthNum <- month(single_msoa_df$DateClean)
single_msoa_df$YearNum <- year(single_msoa_df$DateClean)
this_year_actual <- filter(single_msoa_df, YearNum > 2019)
peak_pandemic_actual <- filter(this_year_actual, MonthNum== 4 | MonthNum== 5 )
actual_burglary <- sum(peak_pandemic_actual$count_by_msoa)
pred_burglary <- sum(predictionPivot$predicted_burglary)
error <- actual_burglary - pred_burglary
percentage_error <- error / pred_burglary
print("Burglary Count")
[1] "Burglary Count"
print(actual_burglary)
print("Predicted")
[1] "Predicted"
print(pred_burglary)
[1] 7.625019
print("Actual Error")
print(error)
[1] -6.625019
print("Percentage Error")
[1]
print(percentage_error)
[1] -0.8688528
In this MSOA, our model predicted nearly 8 burglaries would occur in these two months, based on pre-pandemic trends. In reality, 1 took place - a large error rate, suggesting a strong “COVID effect”.
This process can now be replicated for every MSOA in London, to obtain this metric for each MSOA.
msoa_error_tibble
Our process has completed: we have a “COVID shift” measure for all of London.
We now need to use our forecasts to measure the “error” - this should provide an indication of the “COVID Crime Shift”, or how much the actual crime diverted from the previous forecasts.
I explored various avenues for this: the ideal solution would be a relative rate of the error, as MSOAs with large crime numbers will likely generate large errors, and so a rate would be ideal, though this is complicated by our erratic prediction and mix of positive and negative numbers.
Our final solution has explored two options: - the absolute error number - the relative error once the crime and predictions have been transformed (by adding 50)
We visualise and describe these statistics first to ensure they appear sensible.
#write_csv(msoa_error_tibble, "msoa_error_table2.csv")
msoa_error_tibble <- read_csv("msoa_error_table2.csv")
#msoa_error_tibble<- msoa_error_table[2:980,]
msoa_error_tibble <- msoa_error_tibble[2:980, ]
msoa_error_tibble <- left_join(msoa_error_tibble, robbery_by_msoa, by = "MSOA11CD")
msoa_error_tibble <- left_join(msoa_error_tibble, burglary_by_msoa, by = "MSOA11CD")
msoa_error_tibble$RPDRobbery <- (msoa_error_tibble$robberyActual - msoa_error_tibble$robberyPredicted)/((msoa_error_tibble$robberyPredicted + msoa_error_tibble$robberyActual)/2)
msoa_error_tibble$robberyActualShifted <- msoa_error_tibble$robberyActual + 50
msoa_error_tibble$robberyPredictedShifted <- msoa_error_tibble$robberyPredicted + 50
msoa_error_tibble$burglaryActualShifted <- msoa_error_tibble$burglaryActual + 50
msoa_error_tibble$burglaryPredictedShifted <- msoa_error_tibble$burglaryPredicted + 50
print("Burglary Error")
[1] "Burglary Error"
Min. 1st Qu. Median Mean 3rd Qu. Max.
-97.191 -12.870 -5.271 -6.090 2.111 48.727
print("Burglary Relative Error")
[1] "Burglary Relative Error"
summary(msoa_error_tibble$RPDburglaryShifted)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.81266 -0.20689 -0.08826 -0.07976 0.03612 1.29467
[1] "Robbery Error"
summary(msoa_error_tibble$robberyError)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.150 -3.840 2.205 25.564
print("Robbery Relative Error")
[1] "Robbery Relative Error"
summary(msoa_error_tibble$RPDRobberyShifted)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.45491 -0.11808 -0.04081 -0.04876 0.04309 0.62020
As we can see, the average London MSOA experienced a negative COVID crime shift for both burglary and robbery, but this is far from equally distributed - at the extremes, some areas actually see large increases on our predicted values.
library(ggpubr)
Loading required package: ggplot2
Registered S3 method overwritten by 'data.table':
method from
print.data.table
burg_hist <- ggplot(msoa_error_tibble, aes(x=burglaryError)) + geom_histogram()
rob_hist <-ggplot(msoa_error_tibble, aes(x=robberyError)) + geom_histogram()
burg_r_hist <- ggplot(msoa_error_tibble, aes(x=RPDburglaryShifted)) + geom_histogram()
rob_r_hist <- ggplot(msoa_error_tibble, aes(x=RPDRobberyShifted)) + geom_histogram()
scatter <- ggplot(msoa_error_tibble, aes(x = RPDRobberyShifted, y = RPDburglaryShifted)) +
geom_point()
r_scatter <- ggplot(msoa_error_tibble, aes(x = robberyError, y = burglaryError)) +
geom_point()
ggarrange(rob_hist, burg_hist, rob_r_hist, burg_r_hist,scatter, r_scatter, ncol=2, nrow=3 )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Our shifted relative error rate seems to function as intended: while there are still outliers, they are more concentrated than they are for the pure error term, and the overall distribution is more focused, while still indicating the direction and relative strength of our COVID effect.
Let’s map this effect visually, and see if any particular areas stand out.
#re-ingest our geographic MSOA borders
msoa_borders <- st_read("msoa_borders/MSOA_2011_London_gen_MHW.tab", crs=27700)
Reading layer `MSOA_2011_London_gen_MHW' from data source `' using driver `MapInfo File'
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
geographic_error_map <- left_join(msoa_borders, msoa_error_tibble, by = "MSOA11CD")
burg_map <- tm_shape(geographic_error_map) +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
rob_map <-tm_shape(geographic_error_map) +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
burg_map_rate <- tm_shape(geographic_error_map) +
tm_fill(col = "RPDRobberyShifted", title = "Robbery Error Relative", palette="-PuOr")+
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
rob_map_rate <-tm_shape(geographic_error_map) +
tmap_arrange(burg_map, rob_map, burg_map_rate, rob_map_rate , nrow = 2, ncol=2)
It’s hard to identify any obvious effect visually, but we do notice that while central London sees some very strong reductions, it also sees some increases. Conversely, the outskirts of London (notably to the south and West) are a near continuous area of large decreases. The effect does vary by offence type, but the pattern seen in South and West London appears broadly consistent.
We’ve identified that the COVID crime effect was felt unequally accross London, and varies by offence type. To finalise our project, we will be linking our data to demographic data provided by MOPAC, and aiming to use it to identify correlates to our “covid shift”, and hopefully build models disentangling the effect.
library(readxl)
#ingest ATLAS
msoa_atlas <- read_excel("msoa_atlas/msoa-data.xls")
New names:
* `House Prices Sales 2011` -> `House Prices Sales 2011...129`
* `House Prices Sales 2011` -> `House Prices Sales 2011...130`
geographic_msoa_matrix <- left_join(geographic_error_map, msoa_atlas, by = "MSOA11CD")
#convert to tibble
msoa_matrix_tbl <- as_tibble(geographic_msoa_matrix)
#select only numeric data
msoa_matrix_numeric <-dplyr::select_if(msoa_matrix_tbl, is.numeric)
msoa_matrix_numeric
Having now ingested and linked our data, we begin by exploring the factors most highly correlated with our relative error rates.
corr_df <- correlate(dplyr::select_if(msoa_matrix_tbl, is.numeric), quiet = TRUE)
options(scipen=999)
Starting by our relative robbery error, a few interesting correlates stand out: - road traffic casualties - burglary numbers - the number of dwellings with no usual residents, and the number of commercial residents - households with no cars - the age composition of the area - general deprivation indicators (such as the proportion of households with central heating)
#show only correlates with an absolute value higher than 0.2
#filter(dplyr::select(corr_df[order(corr_df$RPDRobberyShifted),] , term, RPDRobberyShifted), RPDRobberyShifted < -0.2 | RPDRobberyShifted > 0.2)
#filter(dplyr::select(corr_df[order(corr_df$RPDRobberyShifted),] , term, RPDRobberyShifted), RPDRobberyShifted < -0.2 | RPDRobberyShifted > 0.2)
high_corr_rob <- filter(dplyr::select(corr_df, term, RPDRobberyShifted), RPDRobberyShifted < -0.15 | RPDRobberyShifted > 0.15)
high_corr_rob[order(high_corr_rob$RPDRobberyShifted),]
NA
The correlations for burglary are weaker - only a few have an absolute value higher than 0.2 - but a few stand out: - households with no residents - high robbery numbers - house prices
high_corr_burg <- filter(dplyr::select(corr_df, term, RPDburglaryShifted), RPDburglaryShifted < -0.15 | RPDburglaryShifted > 0.15)
high_corr_burg[order(high_corr_burg$RPDburglaryShifted),]
NA
These correlates suggest we can model this shift - this is likely to prove more reliable for robbery (where the correlations are stronger), and seem linked to usual resident population(as measured by household composition), deprivation (through various proxy indicators such as central heating presence or housing type), and general crime patterns (through total burglary and robbery numbers)
We will take two approaches for modelling: a simple regression (to identify strong links) and random forest regressors (to identify non-linear associations)
We begin through the use of simple OLS regression. This is a linear model that has large limitations for modelling complex relationships, but can be an effective first step, effectively with a few transformations.
R does not cope well with blank spaces in terms, so we’ll extract and rename our key correlates.
msoa_copy <- msoa_matrix_numeric
names(msoa_copy)[names(msoa_copy) == "Dwelling type (2011) Household spaces with no usual residents"] <- "DwellingNoResidents"
names(msoa_copy)[names(msoa_copy) == "House Prices Median House Price (£) 2010"] <- "MedianHousePrice"
names(msoa_copy)[names(msoa_copy) == "Dwelling type (2011) Flat, maisonette or apartment"] <- "FlatAprt"
names(msoa_copy)[names(msoa_copy) == "Qualifications (2011 Census) Schoolchildren and full-time students: Age 18 and over"] <- "fullTimeStudents"
names(msoa_copy)[names(msoa_copy) == "Car or van availability (2011 Census) No cars or vans in household"] <- "NoCars"
names(msoa_copy)[names(msoa_copy) == "Ethnic Group (2011 Census) Other ethnic group"] <- "OtherEthnicGroup"
names(msoa_copy)[names(msoa_copy) == "Central Heating (2011 Census) Households with central heating (%)"] <- "CentralHealingPercent"
Let’s now extract these to a separate dataframe, remove any missing values, and provide some quick summary statistics to identify any obvious concerns.
feature_df <- dplyr::select(msoa_copy, RPDburglaryShifted, RPDRobberyShifted, total_burglaries, total_robberies, DwellingNoResidents, MedianHousePrice, FlatAprt, fullTimeStudents, NoCars, OtherEthnicGroup, CentralHealingPercent, AvHholdSz, ComEstRes)
feature_df <- drop_na(feature_df, RPDburglaryShifted, RPDRobberyShifted)
feature_df
summary(feature_df)
RPDburglaryShifted RPDRobberyShifted total_burglaries total_robberies DwellingNoResidents FlatAprt fullTimeStudents NoCars
Min. :-0.81266 Min. :-1.45491 Min. : 46.0 Min. : 2.00 Min. : 14.0 Min. : 137625 Min. : 54.0 Min. : 122.0 Min. : 185
1st Qu.:-0.20689 1st Qu.:-0.11808 1st Qu.: 121.0 1st Qu.: 29.00 1st Qu.: 59.0 1st Qu.: 222500 1st Qu.: 830.5 1st Qu.: 305.5 1st Qu.: 796
Median :-0.08826 Median :-0.04081 Median : 159.0 Median : 53.00 Median : 86.0 Median : 465.0 Median :1256
Mean :-0.07976 Mean :-0.04876 Mean : 175.5 Mean : 78.53 Mean : 123.4 Mean : 310710 Mean :1798.2 Mean : 539.5 Mean :1382
3rd Qu.: 0.03612 3rd Qu.: 0.04309 3rd Qu.: 205.0 3rd Qu.: 92.00 3rd Qu.: 133.0 3rd Qu.: 351525 3rd Qu.:2566.5 3rd Qu.: 654.5 3rd Qu.:1858
Max. : 1.29467 Max. : 0.62020 Max. :1973.0 Max. :2456.00 Max. :1556.0 Max. :1425000 Max. :6429.0 Max. :3370.0 Max. :4319
OtherEthnicGroup CentralHealingPercent AvHholdSz ComEstRes
Min. : 16.0 Min. :92.10 Min. :1.600 Min. : 0
1st Qu.: 134.0 1st Qu.:96.60 1st Qu.:2.300 1st Qu.: 9
Median : 232.0 Median :97.40 Median :2.500 Median : 41
Mean : 285.8 Mean :97.24 Mean :2.506 Mean : 102
3rd Qu.: 376.5 3rd Qu.:98.00 3rd Qu.:2.700 3rd Qu.: 105
Max. :3001.0 Max. :99.60 Max. :3.900 Max. :2172
colSums(is.na(feature_df))
RPDburglaryShifted RPDRobberyShifted total_burglaries total_robberies DwellingNoResidents MedianHousePrice FlatAprt
0 0 0 0 0 0 0
fullTimeStudents NoCars OtherEthnicGroup CentralHealingPercent AvHholdSz ComEstRes
0 0 0 0 0 0
As a quick initial exercise, we’ll create a scatter plot of the interaction between each and every one of our variables. This will probably be too messy to make any real findings, but can serve to quickly highlight strong associations.
pairs(feature_df)
As we hoped, some obvious relationships stand out: for instance, the presence of apartments, and households with no cars, or central heating and average household size.
Our robbery and burglary data and change rates are densely clustered - they’re unlikely to cleanly associate with anything. With that in mind, we’ll perform a log transformation. This cannot be undertaken with negative values, so once again we’ll perform a shift (of 2) for both of our relative error numbers, as well as a commercial resident column, before log transforming our features.
feature_df$BurglaryRPDTranform <- feature_df$RPDburglaryShifted + 2
feature_df$RobberyRPDTranform <- feature_df$RPDRobberyShifted + 2
feature_df$ComEstResTranform <- feature_df$ComEstRes + 2
for (col in colnames(feature_df)){
new_name <- paste("log_", col, sep = "")
feature_df[new_name] <- log(feature_df[col])
}
NaNs producedNaNs produced
drop<- c( "log_ComEstRes", "log_RPDburglaryShifted", "log_RPDRobberyShifted")
feature_df<- feature_df[,!(names(feature_df) %in% drop)]
feat_transform_df <- feature_df[,17:29]
orig_feat_df <- feature_df[,0:17]
#feat_transform_df <- feature_df[,17:28]
colSums(is.na(feature_df))
RPDburglaryShifted RPDRobberyShifted total_burglaries total_robberies DwellingNoResidents MedianHousePrice
0 0 0 0 0 0
FlatAprt fullTimeStudents NoCars OtherEthnicGroup CentralHealingPercent AvHholdSz
0 0 0 0 0 0
ComEstRes BurglaryRPDTranform RobberyRPDTranform ComEstResTranform log_total_burglaries log_total_robberies
0 0 0 0 0 0
log_DwellingNoResidents log_MedianHousePrice log_FlatAprt log_fullTimeStudents log_NoCars log_OtherEthnicGroup
0 0 0 0 0 0
log_CentralHealingPercent log_AvHholdSz log_BurglaryRPDTranform log_RobberyRPDTranform log_ComEstResTranform
0 0 0 0 0
We’ve now separated a separate dataframe where each value has been log transformed - while this isn’t hugely rigorous (and would benefit from inspecting the relationships in more detail) it serves our immediate purpose.
pairs(feat_transform_df)
While we’ve introduced a bit of noise, we’ve also “forced” some of our variables into relationships that look semi linear.
To dig into this deeper, let’s create a correlation matrix for our entire transformed dataframe.
correlate(feat_transform_df)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
To provide a visual aid, I’ve extracted the column for our robbery relative change rate, and sorted the table accordingly.
dplyr::select(correlate(feat_transform_df)[order(correlate(feat_transform_df)$log_BurglaryRPDTranform),], term, log_BurglaryRPDTranform)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Now that we’ve transformed our data, cleaned it up, and identified potential correlates, let’s build our linear model.
There are various automated tools for this process that seek to provide the highest fit and significance, but given the high degree of correlation between my chosen features, I’ve taken a more manual approach and tested a variety of models until I identified one with a suitable fit. The final model is below.
mod_burglary <- lm(log_BurglaryRPDTranform ~ log_total_burglaries + log_FlatAprt + log_MedianHousePrice + log_ComEstResTranform, data = feat_transform_df)
summary(mod_burglary)
Call:
lm(formula = log_BurglaryRPDTranform ~ log_total_burglaries +
log_FlatAprt + log_MedianHousePrice + log_ComEstResTranform,
data = feat_transform_df)
Residuals:
Min 1Q Median -0.06547 -0.00309 0.05830 0.56322
Coefficients:
(Intercept) 1.255068 0.112681 11.138 -6.062 1.92e-09 ***
log_FlatAprt 0.016153 0.005157 3.132 0.001786 ***
log_ComEstResTranform -0.008125 0.002119 -3.833 0.000135 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1027 on 974 degrees of freedom
Multiple R-squared:
F-statistic: 21.72 974 DF, p-value: < 2.2e-16
Our model suggests the largest burglary decrease linked to lockdown was in MSOAs which a high level of historic burglary. The composition of housing/accomodation and area type also seems to play a role, with those areas with higher median house prices, and a larger number of commercial residents, seeing stronger decreases, while conversely areas with large number of apartments temper the effect.
While all our variables are significant, the model is not a particularly good fit - the adjusted R2 is around 0.08, suggesting that less than 10% of the variance is accounted for by our model. I suspect more geographic features - such as distance from central London, more accurate footfall, or spatial lags - would probably be useful, but that’s outside the scope of this project.
Let’s perform a similar exercise for robbery.
dplyr::select(correlate(feat_transform_df)[order(correlate(feat_transform_df)$log_RobberyRPDTranform),], term, log_RobberyRPDTranform)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
mod_burglary <- lm(log_RobberyRPDTranform ~ log_total_robberies + log_total_burglaries + log_FlatAprt + log_CentralHealingPercent, data = feat_transform_df)
summary(mod_burglary)
Call:
lm(formula = log_RobberyRPDTranform ~ log_total_robberies + log_total_burglaries +
log_FlatAprt + log_CentralHealingPercent, data = feat_transform_df)
Residuals:
Min 1Q Median 3Q Max
-0.95347 -0.04269 -0.00516 0.04607 0.31963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.057332 1.394750 -3.626 0.000303 ***
log_total_robberies -0.031452 0.004902 -6.416 0.0000000002177 ***
log_total_burglaries -0.066422 0.009692 -6.853 0.0000000000128 ***
log_FlatAprt 0.029271 0.005070 5.773 0.0000000104411 ***
log_CentralHealingPercent 1.304434 0.301621 4.325 0.0000168425948 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08917 on 974 degrees of freedom
Multiple R-squared: 0.21020.207
F-statistic: 64.81 on 4 and 974 DF, p-value: < 0.00000000000000022
This is a notably better fit than our burglary model, with our R2 suggesting we now account for over 20% of the variance. The nature of our predictors is also quite different: while we still see a negative relationship with historic crime (with areas of high historic crime experiencing larger relative decreases), there is a positive relationship with both the presence of apartments and central heating.
I suspect some of these features are correlates of deprivation, so I want to create a quick scatter of three - for now we’ll do it against median house price, which is definitely deprivation correlated.
heating <- ggplot(feature_df, aes(x = log_MedianHousePrice, y = log_CentralHealingPercent)) +
geom_point()+
geom_smooth(method=lm)
apartments <- ggplot(feature_df, aes(x = log_MedianHousePrice, y = log_FlatAprt)) +
geom_smooth(method=lm)
comest <- ggplot(feature_df, aes(x = log_MedianHousePrice, y = log_ComEstResTranform)) +
geom_point()+
geom_smooth(method=lm)
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
While there does appear to be a relationship with some of these, it isn’t strong - this suggests the factor’s we have identified are significant not because of their association with deprivation and poverty, but because of what they mean about the specific characteristics of the area.
Unlike regression, random forest doesn’t really on any specific type of association - instead, we rely on computing power, repetition and iteration to capture the very best predictor for our variable, in any combination.
There are risks to this method: our sample size is smaller than I’d like, and this may lead to over-fit of outlier MSOAs.
It does mean we don’t need to worry about transformations or correlations: we can return to our original dataset, and let the model identify the strongest predictors.
# we remove rows where our main error is na or missing
rf_msoa_matrix <- drop_na(msoa_matrix_numeric, RPDRobberyShifted)
clean_rf_matrix <- rf_msoa_matrix[ , colSums(is.na(rf_msoa_matrix)) == 0]
#then we drop out any values that directly predict our error.
drop<- c("burglaryActual","burglaryError","burglaryPercentError","burglaryPredicted","robberyActual","robberyPredicted","robberyError","robberyPercentError", "RPDBurglaryShifted", "RPDRobberyShifted")
#remove out selected columns
data<- clean_rf_matrix[,!(names(clean_rf_matrix) %in% drop)]
names(clean_rf_matrix)<-make.names(names(clean_rf_matrix),unique = TRUE)
drop<- c("burglaryActual","burglaryError","robberyActual","robberyError","RPDBurglary","RPDRobbery","robberyActualShifted","robberyPredictedShifted","burglaryActualShifted","burglaryPredictedShifted", "burglaryPredicted", "burglaryPercentError", "robberyPredicted", "robberyPercentError")
#drop<- c("burglaryPercentError","burglaryPredicted","robberyPredicted","robberyPercentError")
We start with modelling our relative rate of burglary shift. We divide our sample into a training set which we’ll use to train the model, and a test set we’ll use to verify accuracy and validity.
sample = sample.split(data$RPDburglaryShifted, SplitRatio = 0.7)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
RPDburglaryShifted ~ .,
data=train,
summary(rf_burglary)
Length Class Mode
call 1 -none- -none- numeric
mse
rsq 500 -none- numeric
oob.times 685 -none- numeric -none- numeric
localImportance 0 -none- NULL
ntree 1 numeric
forest 11 -none- list
coefs 0 -none- NULL
y 685 0 -none- NULL
terms 3
prediction <-predict(rf_burglary, test)
Metrics::rmse(test$RPDburglaryShifted, prediction)
[1] 0.2086745
While machine learning models were once considered opaque and difficult to interpret, several libraries now offer functionality to explain predictions. Here I use DALEX to do just this.
rf_explainer_burglary <- DALEX::explain(rf_burglary, data=train, y= train$RPDburglaryShifted)
Preparation of a new explainer is initiated
-> model label : ( default )
-> data : 685 rows 211 cols
-> data : tibble converted into a data.frame
-> target variable : -> predict function : yhat.randomForest will be used ( default )
-> predicted values : No value for predict function target column. ( default )
randomForest , ver. 4.6.14 , taskdefault )
-> predicted values : numerical, min = -0.5780085 , mean = -0.08106178 , max = 0.3933345
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -0.2656502 , mean = -0.000773038 , max = 0.3310273
A new explainer has been created!
rf_perf_burg
rmse : 0.07978734
r2 : 0.8458048
mad : 0.04547341
Residuals:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-0.265650178 -0.088750799 -0.057666182 -0.039718463 -0.023652909 -0.007873967 0.008104222 0.029419548 0.052525556 0.089675687 0.331027295
We can see that our model significantly outperforms our best linear models: the R2 suggests that almost 85% of the variance is correctly interpreted, and our rmse (root mean squared error) is around 0.2 on our test set.
This model would be ill-suited to prediction or operationalising -it is a default forecast with no hyper-parameter tuning, and no consideration of error rates - but using tools like DALEX, we can identify which predictors the model identifies as most important.
model_parts_burg <-model_parts(rf_explainer_burglary)
model_parts_burg
variable mean_dropout_loss label
1 RPDburglaryShifted 0.07978734 randomForest
3 Road.Casualties.2010.Fatal 0.07989223 randomForest
4 Road.Casualties.2011.Fatal 0.07990359 randomForest
5 Road.Casualties.2012.Fatal 0.07994289 randomForest
6 Age.Structure..2011.Census..All.Ages 0.08016856 randomForest
7 Mid.year.Estimate.totals.All.Ages.2010 0.08020098 randomForest
8 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household.... 0.08023165 randomForest
9 UsualRes 0.08027333 randomForest
10 randomForest
12 Households..2011..All.Households 0.08034082 randomForest
13 Mid.year.Estimate.totals.All.Ages.2011 0.08034805 randomForest
14 Ethnic.Group..2011.Census..White.... 0.08035199 randomForest
15 Mid.year.Estimate.totals.All.Ages.2012 0.08035832 randomForest
16 HholdRes 0.08036393 randomForest
17 Qualifications..2011.Census..Highest.level.of.qualification..Other.qualifications 0.08036488 randomForest
18 randomForest
20 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household
22 Country.of.Birth..2011..Not.United.Kingdom.... 0.08044878 randomForest Mid.year.Estimate.totals.All.Ages.2006 0.08047061 randomForest
25 Tenure..2011..Owned..Owned.outright.... 0.08047637 randomForest
26 Ethnic.Group..2011.Census..BAME.... 0.08047866 randomForest
27 Ethnic.Group..2011.Census..BAME 0.08049934 randomForest
28 Religion..2011..Other.religion.... 0.08050526 randomForest
29 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident 0.08053065 randomForest
30 Hholds 0.08054706 randomForest
31 Household.Language..2011..At.least.one.person.aged.16.and.over.in.household.has.English.as.a.main.language 0.08057152 randomForest
32 Health..2011.Census..Day.to.day.activities.limited.a.little 0.08059003 randomForest
33 Religion..2011..Sikh.... 0.08059177 randomForest
34 Population.Density.Persons.per.hectare..2012. 0.08059472 randomForest
35 Household.Language..2011....of.households.where.no.people.in.household.have.English.as.a.main.language 0.08060243 randomForest
36 Road.Casualties.2011.Serious 0.08060476 randomForest
37 Household.Language..2011..No.people.in.household.have.English.as.a.main.language 0.08060958 randomForest
38 Car.or.van.availability..2011.Census..Sum.of.all.cars.or.vans.in.the.area 0.08064046 randomForest
39 Country.of.Birth..2011..Not.United.Kingdom 0.08066560 randomForest
42 Age.Structure..2011.Census..30.44 0.08066940 randomForest
43 Qualifications..2011.Census..Highest.level.of.qualification..Level.1.qualifications 0.08067143 randomForest
44 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household.... 0.08067284 randomForest
45 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan 0.08067322 randomForest
46 Country.of.Birth..2011..United.Kingdom.... 0.08068890 randomForest
47 Economic.Activity..2011.Census..Economically.active..Total 0.08069008 randomForest
48 Mid.year.Estimate.totals.All.Ages.2004 0.08069682 randomForest
49 Country.of.Birth..2011..United.Kingdom 0.08071084 randomForest
50 Mid.year.Estimate.totals.All.Ages.2003 0.08071974 randomForest
51 Lone.Parents..2011.Census..All.lone.parent.housholds.with.dependent.children 0.08073043 randomForest
52 Health..2011.Census..Day.to.day.activities.limited.a.little.... 0.08075471 randomForest
55 Tenure..2011..Social.rented.... 0.08076387 randomForest
56 PopDen 0.08076430 randomForest
57 Mid.year.Estimates.2012..by.age.60.64 0.08077331 randomForest
58 Mid.year.Estimate.totals.All.Ages.2008 0.08077994 randomForest
59 Religion..2011..No.religion.... 0.08078100 randomForest
60 House.Prices.Sales.2010 0.08078206 randomForest
61 Mid.year.Estimate.totals.All.Ages.2009 0.08078457 randomForest
62 Economic.Activity..2011.Census..Economically.inactive.. 0.08079916 randomForest
63 Ethnic.Group..2011.Census..Asian.Asian.British 0.08080477 randomForest
64 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household 0.08080990 randomForest
65 Mid.year.Estimates.2012..by.age.85.89 0.08081135 randomForest
66 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan.... 0.08082316 randomForest
67 Age.Structure..2011.Census..16.29 0.08083177 randomForest
68 Qualifications..2011.Census..Highest.level.of.qualification..Level.4.qualifications.and.above 0.08083667 randomForest
69 Age.Structure..2011.Census..45.64 0.08083957 randomForest
70 Mid.year.Estimates.2012..by.age.55.59 0.08084109 randomForest
71 Economic.Activity..2011.Census..Economically.active.. 0.08084118 randomForest
72 Religion..2011..Buddhist.... 0.08084915 randomForest
73 House.Prices.Median.House.Price.....2011 0.08085332 randomForest
74 Religion..2011..Hindu.... 0.08085661 randomForest
75 Central.Heating..2011.Census..Households.with.central.heating.... 0.08085781 randomForest
76 Qualifications..2011.Census..Highest.level.of.qualification..Level.2.qualifications 0.08086017 randomForest
77 Ethnic.Group..2011.Census..Other.ethnic.group 0.08086059 randomForest
78 Economic.Activity..2011.Census..Economically.active..Unemployed 0.08086791 randomForest
79 Religion..2011..Jewish.... 0.08087423 randomForest
80 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached.... 0.08087529 randomForest
81 Household.Composition..2011..Numbers.One.person.household 0.08087823 randomForest
82 Mid.year.Estimates.2012..by.age.80.84 0.08088729 randomForest
83 Household.Composition..2011..Numbers.Lone.parent.household 0.08088891 randomForest 0.08092207
87 Household.Composition..2011..Numbers.Couple.household.without.dependent.children 0.08093548 randomForest
88 Adults.in.Employment..2011.Census..No.adults.in.employment.in.household..With.dependent.children 0.08093596 randomForest
89 House.Prices.Median.House.Price.....2005 0.08094486 randomForest
90 Mid.year.Estimate.totals.All.Ages.2002 0.08094711 randomForest
91 Income.Deprivation..2010....living.in.income.deprived.households.reliant.on.means.tested.benefit 0.08094734 randomForest
92 Economic.Activity..2011.Census..Economically.inactive..Total 0.08095970 randomForest
93 Dwelling.type..2011..Flat..maisonette.or.apartment 0.08097084 randomForest
94 Adults.in.Employment..2011.Census....of.households.with.no.adults.in.employment..With.dependent.children 0.08097164 randomForest
95 Mid.year.Estimates.2012..by.age.15.19 0.08097487 randomForest
96 Qualifications..2011.Census..Highest.level.of.qualification..Apprenticeship 0.08099627 randomForest
97 Road.Casualties.2010.Serious 0.08099896 randomForest
98
100 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household 0.08100307 randomForest
101 Qualifications..2011.Census..Schoolchildren.and.full.time.students..Age.18.and.over 0.08102459 randomForest
102 House.Prices.Median.House.Price.....2006 0.08102808 randomForest
103 Car.or.van.availability..2011.Census..1.car.or.van.in.household.... 0.08102910 randomForest
104 House.Prices.Sales.2011...130 0.08103290 randomForest
105 House.Prices.Median.House.Price.....2010 0.08103351 randomForest
106 Religion..2011..No.religion 0.08104129 randomForest
107 Religion..2011..Muslim.... 0.08104171 randomForest
108 randomForest
110 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached 0.08107212 randomForest
111 Tenure..2011..Private.rented 0.08107812 randomForest
112 Household.Composition..2011..Numbers.Couple.household.with.dependent.children randomForest
114 Dwelling.type..2011..Flat..maisonette.or.apartment.... 0.08110023 randomForest
117 Mid.year.Estimates.2012..by.age.90. 0.08110115 randomForest
118 House.Prices.Median.House.Price.....2009 0.08112692 randomForest
119 House.Prices.Sales.2013.p. 0.08112842 randomForest
120 Tenure..2011..Social.rented 0.08113023 randomForest
121 Mid.year.Estimates.2012..by.age.5.9 0.08113618 randomForest
122 House.Prices.Median.House.Price.....2008 0.08114446 randomForest
123 Household.Composition..2011..Numbers.Other.household.Types 0.08116352 randomForest
124 Obesity.Percentage.of.the.population.aged.16..with.a.BMI.of.30...modelled.estimate..2006.2008 0.08116727 randomForest
125 Car.or.van.availability..2011.Census..1.car.or.van.in.household 0.08116794 randomForest
126 Health..2011.Census..Very.good.health.... 0.08118887 randomForest
127 Mid.year.Estimates.2012..by.age.25.29 0.08120511 randomForest
128 Religion..2011..Hindu 0.08122569 randomForest
129 Health..2011.Census..Bad.health 0.08123266 randomForest
130 Mid.year.Estimates.2012..by.age...65. 0.08123475 randomForest
131 Land.Area.Hectares 0.08124618 randomForest
132 Ethnic.Group..2011.Census..Asian.Asian.British.... 0.08125689 randomForest
133 House.Prices.Sales.2009 0.08127707 randomForest
134 Qualifications..2011.Census..No.qualifications 0.08128507 randomForest
135 Mid.year.Estimates.2012..by.age.30.34 0.08131069 randomForest
136 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British.... 0.08131640 randomForest
137 Age.Structure..2011.Census..65. 0.08131793 randomForest
138 Mid.year.Estimates.2012..by.age.75.79 0.08132623 randomForest
139 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British 0.08132628 randomForest
140 Ethnic.Group..2011.Census..White 0.08133305 randomForest
141 Mid.year.Estimates.2012..by.age.50.54 0.08134709 randomForest 0.08136569 0.08136868 randomForest
146 randomForest
148 House.Prices.Median.House.Price.....2012 0.08141885 randomForest
149 Health..2011.Census..Fair.health.... 0.08142456 randomForest
150 AvHholdSz 0.08142887 randomForest
151 Life.Expectancy.Males 0.08143464 randomForest
152 Religion..2011..Religion.not.stated.... 0.08144721 randomForest
153 Religion..2011..Buddhist 0.08146208 randomForest
154 Low.Birth.Weight.Births..2007.2011..LCL...Lower.confidence.limit 0.08146848 randomForest
155 Religion..2011..Sikh 0.08149141 randomForest
156 Health..2011.Census..Very.good.health 0.08149203 randomForest
157 Low.Birth.Weight.Births..2007.2011..Low.Birth.Weight.Births.... 0.08149302 randomForest
158 Mid.year.Estimates.2012..by.age.35.39 0.08149439 randomForest
159 Health..2011.Census..Day.to.day.activities.limited.a.lot 0.08149561 randomForest
160 Tenure..2011..Owned..Owned.outright 0.08149998 randomForest
161 Road.Casualties.2010.Slight 0.08150819 randomForest
162 Low.Birth.Weight.Births..2007.2011..UCL...Upper.confidence.limit 0.08151073 randomForest
163 Road.Casualties.2012.Serious 0.08153523 randomForest
164 Tenure..2011..Private.rented.... 0.08154695 randomForest
165 House.Prices.Sales.2007 0.08156845 randomForest
166 Health..2011.Census..Day.to.day.activities.not.limited.... 0.08164471 randomForest
167 Road.Casualties.2012.2012.Total 0.08166058 randomForest
168 Road.Casualties.2010.2010.Total 0.08167894 randomForest
169 Religion..2011..Other.religion 0.08168736 randomForest
170 Road.Casualties.2012.Slight 0.08169606 randomForest
171 Mid.year.Estimates.2012..by.age.40.44 0.08173699 randomForest
172 Mid.year.Estimate.totals.All.Ages.2007 0.08173808 randomForest
173 Health..2011.Census..Day.to.day.activities.limited.a.lot.... 0.08174045 randomForest
174 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups randomForest
177 Qualifications..2011.Census..Highest.level.of.qualification..Level.3.qualifications 0.08175634 randomForest
178 Dwelling.type..2011..Whole.house.or.bungalow..Detached 0.08177575 randomForest
179 Religion..2011..Christian 0.08179686 randomForest
180 House.Prices.Sales.2005 0.08182753 randomForest
181 House.Prices.Median.House.Price.....2007 0.08184118 randomForest
182 Health..2011.Census..Bad.health.... 0.08185248 randomForest
183 Health..2011.Census..Good.health.... 0.08194459 randomForest
184 total_robberies 0.08196227 randomForest
185 House.Prices.Sales.2006 0.08196583 randomForest
186 Dwelling.type..2011..Household.spaces.with.no.usual.residents.... 0.08200226 randomForest
187 Road.Casualties.2011.2011.Total 0.08203031 randomForest
188 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace. 0.08204300 randomForest
189 Mid.year.Estimates.2012..by.age.20.24 0.08210720 randomForest
190 Mid.year.Estimates.2012..by.age.65.69 0.08211403 randomForest
191 Dwelling.type..2011..Whole.house.or.bungalow..Detached.... 0.08213368 randomForest
192 Household.Income.Estimates..2011.12..Total.Mean.Annual.Household.Income.... 0.08215211 randomForest
193 Mid.year.Estimates.2012..by.age.70.74 0.08221997 randomForest
194 Incidence.of.Cancer.Breast.Cancer 0.08229737 randomForest
195 Road.Casualties.2011.Slight 0.08231561 randomForest
196 Economic.Activity..2011.Census..Unemployment.Rate 0.08239690 randomForest
197 Lone.Parents..2011.Census..Lone.parent.not.in.employment.. 0.08246826 randomForest
198 0.08262825 randomForest
200 Dwelling.type..2011..Household.spaces.with.no.usual.residents 0.08264973 randomForest
201 Life.Expectancy.Females 0.08276533 randomForest
202 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace..... 0.08284255 randomForest
203 Religion..2011..Jewish 0.08293921 randomForest
204 Incidence.of.Cancer.All 0.08306588 randomForest
205 RPDRobberyShifted 0.08370759 randomForest
206 Household.Composition..2011..Percentages.Couple.household.with.dependent.children 0.08393367 randomForest
207 House.Prices.Sales.2008 0.08433984 randomForest
208 Household.Composition..2011..Percentages.Couple.household.without.dependent.children 0.08459338 randomForest
209 Mid.year.Estimates.2012..by.age...0.to.14 0.08578863 randomForest
210 total_burglaries 0.08833999 randomForest
211 ComEstRes 0.09121371 randomForest
212 Mid.year.Estimates.2012..by.age.0.4 0.09159861 randomForest
213 _baseline_ 0.24106307 randomForest
plot(model_parts_burg, max_vars=25)
The three most important features our model highlights are: - the age composition of the population - the number of residents which are in commercial property - the historic number of burglaries
This seems to corroborate our previous regression model. To further unpick these trends, we can use Partial Dependence Plots to identify how the model prediction shifts with these values.
pdp <- model_profile(rf_explainer_burglary)
plot(pdp, variables="total_burglaries")
We still see a strong association between a high number of historic, and a large reduction during the lockdown period
plot(pdp, variables= "Mid.year.Estimates.2012..by.age.0.4")
MSOAs with very large very young populations seem to have a less strong burglary lockdown decrease (though this is likely heavily associated to deprivation)
plot(pdp, variables="ComEstRes")
hist(msoa_matrix_numeric$ComEstRes)
By combining the distribution of commercial residents by MSOA with our PDP, we can see that those MSOAs that are most densely populated by commercial residents see the smallest “covid decrease”(suggesting that those MSOAs that are very heavily residential saw the sharpest drops).
plot(pdp, variables="House.Prices.Sales.2008")
Finally, we can see that those areas that experienced the highest volume of house sales experienced the lowest relative decrease in burglary.
This highlights the importance of identifying correlates in RF models - especially in a dataset where features are highly interlinked, association does not imply causation, and we should be wary of over-interpreting.
We can now repeat our process for our robbery shift.
sample = sample.split(data$RPDRobberyShifted, SplitRatio = 0.75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
rf_robbery <- randomForest(
RPDRobberyShifted ~ .,
data=train,
importance=TRUE
)
summary(rf_robbery)
Length Class Mode
call -none- character
predicted 734 -none- numeric
rsq 500 -none- numeric
importance 420 -none- numeric
localImportance 0 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 734 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
We’ve now trained a model. Let’s now use the DALEX library to understand it, and see how it performs.
rf_explainer_robbery <- DALEX::explain(rf_robbery, data=train, y= train$RPDRobberyShifted)
Preparation of a new explainer is initiated
-> model label : randomForest ( default )
-> data : 734 rows 211 cols
-> data : tibble converted into a data.frame
-> target variable : 734 values
yhat.randomForest will be used ( default )
-> predicted values : No value for predict function target column. ( default )
4.6.14 , task regression ( )
-> predicted values : numerical, min = -0.9645244 , mean = -0.0470117 , max = 0.3797893
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -0.4903862 , mean = -0.001156124 , max = 0.2960573
rf_perf_rob <- model_performance(rf_explainer_robbery)
rf_perf_rob
Measures for:
r2 : 0.8666274
mad : 0.02860048
Residuals:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-0.490386240 -0.062713767 -0.038399387 -0.024521257 -0.013158676 -0.004124880 0.004984989 0.019451443 0.035391571 0.063885825 0.296057263
model_parts_rob <-model_parts(rf_explainer_robbery)
model_parts_rob
variable mean_dropout_loss label
1 _full_model_ 0.05952187 randomForest
2 RPDRobberyShifted 0.05952187 randomForest
3 Road.Casualties.2012.Fatal 0.05956864 randomForest
4 Road.Casualties.2010.Fatal 0.05958931 randomForest
5 Road.Casualties.2011.Fatal 0.05959538 randomForest
6 HholdRes 0.05970861 randomForest
7 Mid.year.Estimate.totals.All.Ages.2010 0.05973753 randomForest
8 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household.... 0.05975769 randomForest
9 Mid.year.Estimate.totals.All.Ages.2011 0.05976137 randomForest
10 Health..2011.Census..Day.to.day.activities.not.limited 0.05977415 randomForest
11 AvHholdSz 0.05977911 randomForest
12 Age.Structure..2011.Census..All.Ages 0.05978227 randomForest
13 UsualRes 0.05979675 randomForest
14 Lone.Parents..2011.Census..All.lone.parent.housholds.with.dependent.children 0.05980643 randomForest
15 Mid.year.Estimate.totals.All.Ages.2006 0.05981345 randomForest
16 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household 0.05986355 randomForest
17 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident 0.05986499 randomForest
18 Tenure..2011..Owned..Owned.outright 0.05986769 randomForest 0.05988390 randomForest 0.05988822 randomForest
23 Road.Casualties.2010.Serious 0.05988899 randomForest
24 Adults.in.Employment..2011.Census....of.households.with.no.adults.in.employment..With.dependent.children 0.05988994 randomForest
25 House.Prices.Median.House.Price.....2009 0.05989951 randomForest
26 Economic.Activity..2011.Census..Economically.active..Total 0.05990367 randomForest
27 Religion..2011..Jewish.... 0.05991359 randomForest
28 Household.Language..2011..At.least.one.person.aged.16.and.over.in.household.has.English.as.a.main.language 0.05991422 randomForest
29 Dwelling.type..2011..Whole.house.or.bungalow..Detached.... 0.05991695 randomForest
30 Dwelling.type..2011..Whole.house.or.bungalow..Detached 0.05991703 randomForest
31 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached.... 0.05992179 randomForest
32 House.Prices.Median.House.Price.....2010 0.05992630 randomForest
33 Mid.year.Estimates.2012..by.age.70.74 0.05993161 randomForest
34 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household 0.05993504 randomForest
35 randomForest
37 Mid.year.Estimate.totals.All.Ages.2007 0.05995053 randomForest
38 Mid.year.Estimate.totals.All.Ages.2004 0.05996353 randomForest
39 Mid.year.Estimate.totals.All.Ages.2012 0.05996873 randomForest
40 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached 0.05997034 randomForest
41 Age.Structure..2011.Census..65. 0.05997184 randomForest
42 Qualifications..2011.Census..Highest.level.of.qualification..Level.2.qualifications 0.05997500 randomForest
43 Religion..2011..Other.religion.... 0.05997794 randomForest
44 Health..2011.Census..Very.good.health 0.05997835 randomForest
45 House.Prices.Median.House.Price.....2005 0.05998924 randomForest
46 Tenure..2011..Social.rented.... 0.05999127 randomForest
47 Age.Structure..2011.Census..Working.age 0.05999621 randomForest
48 Mid.year.Estimates.2012..by.age...65. 0.06001570 randomForest
49 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household
52 House.Prices.Median.House.Price.....2011 0.06004060 randomForest
53 Household.Composition..2011..Numbers.Lone.parent.household 0.06004582 randomForest
54 Obesity.Percentage.of.the.population.aged.16..with.a.BMI.of.30...modelled.estimate..2006.2008 0.06004786 randomForest
55 Car.or.van.availability..2011.Census..Sum.of.all.cars.or.vans.in.the.area 0.06004791 randomForest
56 Mid.year.Estimates.2012..by.age.80.84 0.06004946 randomForest
57 Mid.year.Estimates.2012..by.age.50.54 0.06005044 randomForest 0.06005591 randomForest
60 Age.Structure..2011.Census..30.44 0.06005931 randomForest
61 Car.or.van.availability..2011.Census..1.car.or.van.in.household 0.06006814 randomForest
62 Adults.in.Employment..2011.Census..No.adults.in.employment.in.household..With.dependent.children 0.06007092 randomForest
63 Households..2011..All.Households 0.06007556 randomForest
64 House.Prices.Median.House.Price.....2007 0.06007826 randomForest
65 Economic.Activity..2011.Census..Economically.inactive.. 0.06008955 randomForest
66 Household.Income.Estimates..2011.12..Total.Mean.Annual.Household.Income.... 0.06009647 randomForest randomForest
69 Qualifications..2011.Census..No.qualifications 0.06010873 randomForest
70 Lone.Parents..2011.Census..Lone.parents.not.in.employment 0.06011036 randomForest
71 Age.Structure..2011.Census..0.15 0.06011788 randomForest
72 Household.Composition..2011..Percentages.Lone.parent.household 0.06011867 randomForest
73 Mid.year.Estimates.2012..by.age.40.44 0.06012597 randomForest
74 House.Prices.Sales.2011...129 0.06012701 randomForest
75 Income.Deprivation..2010....living.in.income.deprived.households.reliant.on.means.tested.benefit 0.06012765 randomForest
76 Health..2011.Census..Bad.health.... 0.06013151 randomForest
77 House.Prices.Median.House.Price.....2013..p. 0.06013614 0.06016171 randomForest
80 Country.of.Birth..2011..United.Kingdom 0.06016239 randomForest
81 House.Prices.Sales.2013.p. 0.06016255 randomForest 0.06017141 randomForest
84 Economic.Activity..2011.Census..Unemployment.Rate 0.06017764 randomForest
85 Household.Composition..2011..Percentages.Couple.household.without.dependent.children 0.06019165 randomForest
86 House.Prices.Sales.2010 0.06019170 randomForest
87 House.Prices.Sales.2005 0.06019370 randomForest
88 House.Prices.Sales.2006 0.06019433 randomForest
89 Mid.year.Estimates.2012..by.age.55.59 0.06019794 randomForest
90 Health..2011.Census..Day.to.day.activities.limited.a.little.... 0.06019800 randomForest
91 Economic.Activity..2011.Census..Economically.active.. 0.06020157 randomForest
92 Tenure..2011..Owned..Owned.outright.... 0.06020697 randomForest
93 Mid.year.Estimates.2012..by.age.30.34 0.06020928 randomForest
94 House.Prices.Median.House.Price.....2012 0.06020958 randomForest
95 Mid.year.Estimates.2012..by.age.45.49 0.06022425 randomForest
96 Central.Heating..2011.Census..Households.with.central.heating.... 0.06023901 randomForest
97 Religion..2011..Jewish 0.06024087 randomForest
98 Health..2011.Census..Very.bad.health.... 0.06024264 randomForest
99 Household.Composition..2011..Numbers.Couple.household.with.dependent.children 0.06025090 randomForest
100 Mid.year.Estimates.2012..by.age.75.79 0.06025885 randomForest
101 Mid.year.Estimate.totals.All.Ages.2002 0.06027020 randomForest
102 Household.Income.Estimates..2011.12..Total.Median.Annual.Household.Income.... 0.06027310 randomForest
103 Household.Composition..2011..Numbers.Couple.household.without.dependent.children 0.06027733 randomForest
104 Qualifications..2011.Census..Highest.level.of.qualification..Level.1.qualifications 0.06027961 randomForest
105 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident.... 0.06029249 randomForest
106 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan.... 0.06029446 randomForest
107 Dwelling.type..2011..Household.spaces.with.no.usual.residents 0.06031695 randomForest
108 Religion..2011..No.religion 0.06031908 randomForest
109 randomForest 0.06032636 randomForest
113 Health..2011.Census..Day.to.day.activities.limited.a.lot.... 0.06032787 randomForest
114 Mid.year.Estimates.2012..by.age.5.9 0.06032920 randomForest
115 Road.Casualties.2012.Serious 0.06033966 randomForest
116 Health..2011.Census..Good.health 0.06034929 randomForest
117 Religion..2011..No.religion.... 0.06036163 randomForest
118 Income.Deprivation..2010....of.people.aged.over.60.who.live.in.pension.credit.households 0.06036686 randomForest
119 Religion..2011..Hindu.... 0.06036907 randomForest
120 Dwelling.type..2011..Flat..maisonette.or.apartment 0.06037343 randomForest
121 Health..2011.Census..Fair.health.... 0.06037424 randomForest
122 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace. 0.06037583 randomForest
123 Lone.Parents..2011.Census..Lone.parent.not.in.employment.. 0.06037930 0.06038862 randomForest
126 Economic.Activity..2011.Census..Economically.active..Unemployed 0.06039167 randomForest
127 House.Prices.Median.House.Price.....2008 0.06039358 randomForest
128 Religion..2011..Christian.... 0.06039596 randomForest
129 Mid.year.Estimates.2012..by.age.20.24 0.06039727 randomForest
130 Religion..2011..Other.religion 0.06039886 randomForest
131 ComEstRes 0.06040057 randomForest
132 Household.Composition..2011..Percentages.Other.household.Types 0.06040106 randomForest
133 House.Prices.Sales.2007
135 Household.Composition..2011..Numbers.Other.household.Types 0.06040495 randomForest
138 Dwelling.type..2011..Household.spaces.with.no.usual.residents.... 0.06044575 randomForest
139 Religion..2011..Muslim.... 0.06045004 randomForest
140 Health..2011.Census..Day.to.day.activities.limited.a.little 0.06045210 randomForest
141 Tenure..2011..Private.rented.... 0.06045264 randomForest
142 Ethnic.Group..2011.Census..White 0.06045855 Low.Birth.Weight.Births..2007.2011..LCL...Lower.confidence.limit 0.06046500 randomForest
145 Dwelling.type..2011..Flat..maisonette.or.apartment.... 0.06046511 randomForest
146 House.Prices.Sales.2008 0.06046573 randomForest
147 Religion..2011..Christian 0.06046620 randomForest
148 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household.... 0.06046913 randomForest
149 Ethnic.Group..2011.Census..Asian.Asian.British 0.06048039 randomForest
150 Religion..2011..Hindu 0.06049182 randomForest
151 Household.Composition..2011..Percentages.One.person.household 0.06049665 randomForest
152 House.Prices.Sales.2009 0.06049784 randomForest
153 House.Prices.Sales.2011...130 0.06049885 Health..2011.Census..Very.good.health.... 0.06051842 randomForest
156 Low.Birth.Weight.Births..2007.2011..Low.Birth.Weight.Births.... 0.06052278 randomForest
157 Health..2011.Census..Very.bad.health 0.06052527 randomForest
158 Health..2011.Census..Fair.health 0.06052537 randomForest
159 randomForest
161 Ethnic.Group..2011.Census..Asian.Asian.British.... 0.06054116 randomForest
162 Low.Birth.Weight.Births..2007.2011..UCL...Upper.confidence.limit 0.06055614 randomForest
163 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British 0.06056211 randomForest
164 Qualifications..2011.Census..Highest.level.of.qualification..Other.qualifications 0.06056219 randomForest
165 Mid.year.Estimates.2012..by.age.0.4 0.06056244 randomForest
166 Population.Density.Persons.per.hectare..2012. 0.06057661 randomForest
167 Religion..2011..Religion.not.stated.... 0.06058630 randomForest
168 Religion..2011..Religion.not.stated 0.06058679 randomForest
169 Religion..2011..Sikh 0.06059155 randomForest
170 Incidence.of.Cancer.Breast.Cancer 0.06062165 randomForest
171 Country.of.Birth..2011..United.Kingdom.... 0.06062725 randomForest
172 Ethnic.Group..2011.Census..BAME 0.06066563 randomForest
173 Country.of.Birth..2011..Not.United.Kingdom 0.06068361 randomForest
174 Mid.year.Estimates.2012..by.age...0.to.14 0.06068807 randomForest
175 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan 0.06068995 randomForest
176 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups 0.06069163 randomForest
177 Qualifications..2011.Census..Highest.level.of.qualification..Level.3.qualifications 0.06072193 randomForest
178 Car.or.van.availability..2011.Census..1.car.or.van.in.household.... 0.06074983 randomForest
179 Mid.year.Estimates.2012..by.age.15.19 0.06075380 randomForest
180 Household.Language..2011....of.people.aged.16.and.over.in.household.have.English.as.a.main.language 0.06075892 randomForest
181 Health..2011.Census..Bad.health 0.06076332 randomForest
182 Health..2011.Census..Day.to.day.activities.limited.a.lot 0.06077496 randomForest
183 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British.... 0.06082769 randomForest
184 Household.Language..2011..No.people.in.household.have.English.as.a.main.language 0.06085115 randomForest
185 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household 0.06085605 randomForest
186 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household.... 0.06087480 randomForest
187 Health..2011.Census..Good.health.... 0.06088495 randomForest
188 Household.Language..2011....of.households.where.no.people.in.household.have.English.as.a.main.language 0.06090083 Ethnic.Group..2011.Census..White.... 0.06100824 randomForest
191 Life.Expectancy.Females 0.06113049 randomForest
192 Economic.Activity..2011.Census..Economically.inactive..Total 0.06114768 randomForest
193 Age.Structure..2011.Census..16.29 0.06116905 randomForest
194 Ethnic.Group..2011.Census..BAME.... 0.06120881 randomForest
195 Life.Expectancy.Males 0.06122637 randomForest
196
198 Incidence.of.Cancer.All 0.06157997 randomForest
199 RPDburglaryShifted
201 Road.Casualties.2012.2012.Total 0.06221397 randomForest
202 Religion..2011..Buddhist.... 0.06230384 randomForest 0.06238647 randomForest
205 Road.Casualties.2011.2011.Total 0.06269966 randomForest
206 Road.Casualties.2010.2010.Total 0.06275241 randomForest
207 Religion..2011..Buddhist 0.06309183 randomForest
208 Ethnic.Group..2011.Census..Other.ethnic.group 0.06343477 randomForest
209 Road.Casualties.2010.Slight 0.06363878 randomForest
210 Ethnic.Group..2011.Census..Other.ethnic.group.... 0.06459224 randomForest
211 total_burglaries 0.06882403 randomForest
212 total_robberies randomForest
plot(model_parts_rob, max_vars=25)
pdp_rob <- model_profile(rf_explainer_robbery)
plot(pdp_rob, variables="total_robberies")
plot(pdp_rob, variables="total_burglaries")
The effect of historic crime effects again appears like a reliable predictor of a robbery covid shift: those MSOAs with the highest number of burglaries and robberies see strong decreases in robbery (though the association with burglary is not clear cut, suggesting other interaction effects may be driving this)
plot(pdp_rob, variables="Road.Casualties.2011.2011.Total")
Road casualties is another strong relationship. This could be a proxy for deprivation, but I suggest this is more down to geographic features - road casualties are probably rarer in denser urban environments, most likely to be affected by lockdown.
plot(pdp_rob, variables="Ethnic.Group..2011.Census..Other.ethnic.group....")
hist(data$Ethnic.Group..2011.Census..Other.ethnic.group....)
Finally, we see a demographic predictor linked to “other ethnic group”. I’m not clear how to interpret this, but it suggests that those MSOAs that are most diverse (and have the largest representation by these ethnic groups) experienced the strongest decreases in robbery during lockdown.
Together, these analyses suggest that the crime drop for robbery and burglary during national lockdown was significantly affected by distinct local factors. For both offences, historical crime trends play a role, with high crime areas experiencing a relatively larger drop in both burglary and robbery.
Beyond that, the drivers vary for each offence type: burglary was driven by the composition of the residential population, with heavily residential areas, and areas with a relatively “stable” population as measured by low housing sales also saw larger decrease - this is likely due to the increased number of empty residential properties.
Robbery decreases conversely, are well associated with high numbers of road casualties, as well as the ethnic makeup of the local population - this is likely due to an association with denser, more urban areas, with lower street speeds, and a possible link to deprivation, whereby minority population were least able to work from home, and were likely to still present as available targets for robbery.